Case Study 2

## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel

Data import

Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame':    870 obs. of  36 variables:
 $ ID                      : num  1 2 3 4 5 6 7 8 9 10 ...
 $ Age                     : num  32 40 35 32 24 27 41 37 34 34 ...
 $ Attrition               : chr  "No" "No" "No" "No" ...
 $ BusinessTravel          : chr  "Travel_Rarely" "Travel_Rarely" "Travel_Frequently" "Travel_Rarely" ...
 $ DailyRate               : num  117 1308 200 801 567 ...
 $ Department              : chr  "Sales" "Research & Development" "Research & Development" "Sales" ...
 $ DistanceFromHome        : num  13 14 18 1 2 10 5 10 10 10 ...
 $ Education               : num  4 3 2 4 1 2 5 4 4 4 ...
 $ EducationField          : chr  "Life Sciences" "Medical" "Life Sciences" "Marketing" ...
 $ EmployeeCount           : num  1 1 1 1 1 1 1 1 1 1 ...
 $ EmployeeNumber          : num  859 1128 1412 2016 1646 ...
 $ EnvironmentSatisfaction : num  2 3 3 3 1 4 2 4 3 4 ...
 $ Gender                  : chr  "Male" "Male" "Male" "Female" ...
 $ HourlyRate              : num  73 44 60 48 32 32 90 88 87 92 ...
 $ JobInvolvement          : num  3 2 3 3 3 3 4 2 3 2 ...
 $ JobLevel                : num  2 5 3 3 1 3 1 2 1 2 ...
 $ JobRole                 : chr  "Sales Executive" "Research Director" "Manufacturing Director" "Sales Executive" ...
 $ JobSatisfaction         : num  4 3 4 4 4 1 3 4 3 3 ...
 $ MaritalStatus           : chr  "Divorced" "Single" "Single" "Married" ...
 $ MonthlyIncome           : num  4403 19626 9362 10422 3760 ...
 $ MonthlyRate             : num  9250 17544 19944 24032 17218 ...
 $ NumCompaniesWorked      : num  2 1 2 1 1 1 2 2 1 1 ...
 $ Over18                  : chr  "Y" "Y" "Y" "Y" ...
 $ OverTime                : chr  "No" "No" "No" "No" ...
 $ PercentSalaryHike       : num  11 14 11 19 13 21 12 14 19 14 ...
 $ PerformanceRating       : num  3 3 3 3 3 4 3 3 3 3 ...
 $ RelationshipSatisfaction: num  3 1 3 3 3 3 1 3 4 2 ...
 $ StandardHours           : num  80 80 80 80 80 80 80 80 80 80 ...
 $ StockOptionLevel        : num  1 0 0 2 0 2 0 3 1 1 ...
 $ TotalWorkingYears       : num  8 21 10 14 6 9 7 8 1 8 ...
 $ TrainingTimesLastYear   : num  3 2 2 3 2 4 5 5 2 3 ...
 $ WorkLifeBalance         : num  2 4 3 3 3 2 2 3 3 2 ...
 $ YearsAtCompany          : num  5 20 2 14 6 9 4 1 1 8 ...
 $ YearsInCurrentRole      : num  2 7 2 10 3 7 2 0 1 2 ...
 $ YearsSinceLastPromotion : num  0 4 2 5 1 1 0 0 0 7 ...
 $ YearsWithCurrManager    : num  3 9 2 7 3 7 3 0 0 7 ...
 - attr(*, "spec")=
  .. cols(
  ..   ID = col_double(),
  ..   Age = col_double(),
  ..   Attrition = col_character(),
  ..   BusinessTravel = col_character(),
  ..   DailyRate = col_double(),
  ..   Department = col_character(),
  ..   DistanceFromHome = col_double(),
  ..   Education = col_double(),
  ..   EducationField = col_character(),
  ..   EmployeeCount = col_double(),
  ..   EmployeeNumber = col_double(),
  ..   EnvironmentSatisfaction = col_double(),
  ..   Gender = col_character(),
  ..   HourlyRate = col_double(),
  ..   JobInvolvement = col_double(),
  ..   JobLevel = col_double(),
  ..   JobRole = col_character(),
  ..   JobSatisfaction = col_double(),
  ..   MaritalStatus = col_character(),
  ..   MonthlyIncome = col_double(),
  ..   MonthlyRate = col_double(),
  ..   NumCompaniesWorked = col_double(),
  ..   Over18 = col_character(),
  ..   OverTime = col_character(),
  ..   PercentSalaryHike = col_double(),
  ..   PerformanceRating = col_double(),
  ..   RelationshipSatisfaction = col_double(),
  ..   StandardHours = col_double(),
  ..   StockOptionLevel = col_double(),
  ..   TotalWorkingYears = col_double(),
  ..   TrainingTimesLastYear = col_double(),
  ..   WorkLifeBalance = col_double(),
  ..   YearsAtCompany = col_double(),
  ..   YearsInCurrentRole = col_double(),
  ..   YearsSinceLastPromotion = col_double(),
  ..   YearsWithCurrManager = col_double()
  .. )

The data set has 870 observations and 36 variables for us to work with.

Pre-Processing

Here I am doing some processing of the data. I am going to convert some character columns to factors to make modeling easier. I am also going to add variables to this step as I proceed with my analysis

colsToFactor <- c("Attrition", "BusinessTravel", "Department", "EducationField", 
    "Gender", "JobRole", "MaritalStatus", "Over18", "OverTime", "StockOptionLevel", 
    "JobLevel", "JobInvolvement", "Education", "EnvironmentSatisfaction", "JobSatisfaction", 
    "RelationshipSatisfaction", "WorkLifeBalance", "PerformanceRating")
# Consider 'StockOptionLevel','JobLevel','JobInvolvement','Education'; They
# could be coded as numeric
casedata[, colsToFactor] <- lapply(casedata[, colsToFactor], as.factor)

casedata$logMonthlyIncome <- log(casedata$MonthlyIncome)

casedata$IncomeLt4000 <- ifelse(casedata$MonthlyIncome <= 4000, 1, 0)
casedata$DistHomeFactor <- cut(casedata$DistanceFromHome, c(0, 10, 20, 30), 
    labels = c("Close", "Medium", "Far"), include.lowest = TRUE)
casedata$AgeGroup <- cut(casedata$Age, c(18, 25, 35, 45, 60), labels = c("18-25", 
    "25-35", "35-45", "45-60"), include.lowest = TRUE)
casedata$NumCompCat <- cut(casedata$NumCompaniesWorked, c(0, 2, 6, 9), labels = c("0-2", 
    "2-6", "6-9"), include.lowest = TRUE)
casedata$WorkingYearsGroup <- cut(casedata$TotalWorkingYears, c(0, 5, 10, 15, 
    20, 40), labels = c("0-5", "5-10", "10-15", "15-20", "20-40"), include.lowest = TRUE)
casedata$RoleYearsGroup <- cut(casedata$YearsInCurrentRole, c(0, 3, 6, 10, 20), 
    labels = c("0-3", "3-6", "6-10", "10+"), include.lowest = TRUE)
casedata$CompanyYearsGroup <- cut(casedata$YearsAtCompany, c(0, 3, 10, 20, 40), 
    labels = c("0-3", "3-10", "10-20", "20-40"), include.lowest = TRUE)
casedata$IncomeGroup <- cut(casedata$MonthlyIncome, c(0, 4000, 8000, 12000, 
    16000, 20000), labels = c("<$4K", "$4K - $8K", "$8K-$12K", "$12K-$16K", 
    "$16K-$20K"), include.lowest = TRUE)

vars <- c("MonthlyIncome", "Age", "TotalWorkingYears", "YearsAtCompany", "YearsInCurrentRole", 
    "YearsWithCurrManager", "YearsInCurrentRole", "YearsSinceLastPromotion")

# This is an issue because of zeroes
casedata <- casedata %>% mutate_at(vars, list(log = log))

Exploratory Analysis

In this section I am going to look mostly at individual variables to try to get a sense of what each variable means and whether or not I should consider it for further analysis and modeling.

Attrition

Attrition Count Proportion
No 730 83.9%
Yes 140 16.1%

Out of the 870 employees, 140 left their jobs, which is 16% attrition

Business Travel

BusinessTravel Count Proportion
Non-Travel 94 10.8%
Travel_Frequently 158 18.2%
Travel_Rarely 618 71.0%

681 employees or 71% travel rarely. It seems like the most frequent travelers have the highest attrition rates, and non-travelers have the lowest. Non-travelers have a lower 75th percentile on income, Frequent and non-frequent travelers have similar pay.

Department

Department Count Proportion
Human Resources 35 4.02%
Research & Development 562 64.6%
Sales 273 31.4%

The sales department has the highest attrition rate and R&D has the lowest attrition rate. Mean income is fairly similar but their is a clear difference in medians with HR having the lowest median pay and Sales having the highest median pay.

Education

Education Count Proportion
1 98 11.3%
2 182 20.9%
3 324 37.2%
4 240 27.6%
5 26 2.99%

Assuming higher values of Education mean more advance education, better educated employees seem to have lower attrtion rates. 1-3 are similar but 4 & 5 have a clear decrease. Education 5 also has a very clear pay advantage.

EducationField Count Proportion
Human Resources 15 1.72%
Life Sciences 358 41.1%
Marketing 100 11.5%
Medical 270 31.0%
Other 52 5.98%
Technical Degree 75 8.62%

Those with education in HR have the highest attrition rates in this sample. Education in Human Resources seems to come with lower median pay, but those educated in marketing have higher median pay

Environment Satisfaction

EnvironmentSatisfaction Count Proportion
1 172 19.8%
2 178 20.5%
3 258 29.7%
4 262 30.1%

Assuming lower is less satisfied, those least satisfied with their enviroment have the highest attrition rates. Mean and Median pay is similar between groups here, but the 75th percentile is lower for groups 2&4. Not sure how this is meaningful to pay overall.

Gender

Gender Count Proportion
Female 354 40.7%
Male 516 59.3%

Attrition rates are similar between genders, with Men having just a slightly higher rate. In terms of pay, Men are making a little less than average in this sample.

Job Involvement

JobInvolvement Count Proportion
1 47 5.40%
2 228 26.2%
3 514 59.1%
4 81 9.31%

Those with lower job involvement have much higher attrition. The relationship with pay is much less clear.

Job Level

JobLevel Count Proportion
1 329 37.8%
2 312 35.9%
3 132 15.2%
4 60 6.90%
5 37 4.25%

Job level 1 has a much higher attrition rate. Job level also seems very linearly associated with pay.

Job Role

JobRole Count Proportion
Healthcare Representative 76 8.74%
Human Resources 27 3.10%
Laboratory Technician 153 17.6%
Manager 51 5.86%
Manufacturing Director 87 10.0%
Research Director 51 5.86%
Research Scientist 172 19.8%
Sales Executive 200 23.0%
Sales Representative 53 6.09%

Sales Representatives have the highest attrition rate by far. Managers and Research Directors have the highest pay and low attrition rates.

Marital Status

MaritalStatus Count Proportion
Divorced 191 22.0%
Married 410 47.1%
Single 269 30.9%

Single employees have the highest attrition and below average pay.

Num Companies Worked

I would like to think this is the number of companies worked before this one because the minimum is zero, so those in that category would have only worked at this company. Otherwise I don’t know how zero makes sense unless we are counting employees who have been hired but not started.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   1.000   2.000   2.728   4.000   9.000 

It seems like those that attrite have worked at fewer companies?

Over Time

OverTime Count Proportion
No 618 71.0%
Yes 252 29.0%

Overtime eligible employees have much higher attrition rates and slightly lower monthly income.

Stock Option Level

StockOptionLevel Count Proportion
0 379 43.6%
1 355 40.8%
2 81 9.31%
3 55 6.32%

Most employees fall in Stock Option level 0 or 1, with much fewer in 2 & 3. Stock option levels 0 & 3 have the highest attrition, with 1 & 2 being much lower. Perhaps 1 & 2 have more long term incentives to keep employees at the firm. 1 & 2 also have higher median / average pay.

Daily, Hourly, and Monthly Rates

I am not seeing any meaningful relationship to attrition or Monthly Income for Daily Rate

I am not seeing any meaningful relationship to attrition or Monthly Income for Hourly Rate, and Hourly Rate doesn’t seem related to Daily Rate. I would normally assoicate Hourly/Daily rates with income but in this case either this is not true or somehow doesn’t translate to monthly income. I will probably ignore these.

Other Variables

  • ID
    • EmployeeNumber
    • EmployeeCount
    • Over18
    • StandardHours

These variables are not particularly useful for analysis because they either are individual IDs or do not vary. Every row is one employee, every employee is over 18, and every employee has standard hours of 80. 18 is a standard working age, 80 hours would seem high if it was a weekly number but don’t have any information about it other than it is constant.

Additional Analysis

Performance and Salary

There are only two ratings, I wanted confirmation that 4 is a better rating, and this plot shows those that have a rating of 4 get larger salary increases.

Variable Importance

Trying to get an idea of what variables might be important for modeling

                                          No         Yes
Age                               6.52185425  6.52185425
BusinessTravelTravel_Frequently  -0.26541400 -0.26541400
BusinessTravelTravel_Rarely       0.46717934  0.46717934
DepartmentResearch & Development  1.19310689  1.19310689
DepartmentSales                   2.55332552  2.55332552
DistanceFromHome                  1.98471359  1.98471359
Education2                       -1.06722009 -1.06722009
Education3                       -1.40306383 -1.40306383
Education4                       -0.01705006 -0.01705006
Education5                       -1.88722037 -1.88722037
EducationFieldLife Sciences       0.30416695  0.30416695
EducationFieldMarketing           4.71069204  4.71069204
EducationFieldMedical             1.04794352  1.04794352
EducationFieldOther              -2.80707947 -2.80707947
EducationFieldTechnical Degree   -0.56185017 -0.56185017
EnvironmentSatisfaction2         -0.88218182 -0.88218182
EnvironmentSatisfaction3         -0.51665115 -0.51665115
EnvironmentSatisfaction4          2.27915035  2.27915035
GenderMale                       -0.82319377 -0.82319377
JobInvolvement2                  -2.00022141 -2.00022141
JobInvolvement3                   0.85106244  0.85106244
JobInvolvement4                  -0.10973691 -0.10973691
JobLevel2                         1.44956652  1.44956652
JobLevel3                        -0.40884542 -0.40884542
JobLevel4                        -0.23505241 -0.23505241
JobLevel5                        -1.19875153 -1.19875153
JobRoleHuman Resources           -0.29187424 -0.29187424
JobRoleLaboratory Technician      1.59762428  1.59762428
JobRoleManager                   -0.29298418 -0.29298418
JobRoleManufacturing Director     0.75891419  0.75891419
JobRoleResearch Director          1.65859574  1.65859574
JobRoleResearch Scientist         2.99278442  2.99278442
JobRoleSales Executive           -0.85877930 -0.85877930
JobRoleSales Representative       1.90627863  1.90627863
JobSatisfaction2                 -0.50095252 -0.50095252
JobSatisfaction3                 -0.04278634 -0.04278634
JobSatisfaction4                  2.64712819  2.64712819
 [ reached 'max' / getOption("max.print") -- omitted 22 rows ]

rf variable importance

  only 20 most important variables shown (out of 59)

                              Overall
JobLevel3                     100.000
JobLevel2                      86.050
TotalWorkingYears              44.842
JobRoleResearch Director       35.647
JobLevel4                      31.699
JobLevel5                      24.098
JobRoleManager                 22.089
JobRoleLaboratory Technician   22.035
JobRoleSales Executive         18.454
YearsAtCompany                 15.715
YearsInCurrentRole             14.394
JobRoleResearch Scientist      13.504
JobRoleManufacturing Director  12.843
Age                            12.065
JobRoleSales Representative    10.289
DepartmentSales                10.162
BusinessTravelTravel_Rarely     9.984
Education5                      9.708
NumCompaniesWorked              9.325
YearsWithCurrManager            9.218
rf variable importance

  only 20 most important variables shown (out of 59)

                              Overall
JobLevel2                     100.000
JobLevel3                      78.826
TotalWorkingYears              59.557
JobRoleResearch Director       27.147
JobLevel4                      25.509
JobRoleLaboratory Technician   25.354
JobRoleSales Executive         19.897
JobRoleManager                 17.915
JobLevel5                      17.901
YearsAtCompany                 16.193
JobRoleResearch Scientist      15.634
Age                            13.245
YearsInCurrentRole             13.118
JobRoleManufacturing Director  12.688
YearsWithCurrManager           12.278
YearsSinceLastPromotion        12.185
DepartmentSales                 9.979
JobRoleSales Representative     9.898
JobSatisfaction2                9.486
BusinessTravelTravel_Rarely     9.188

Linear Model

# A tibble: 60 x 5
   term                               estimate std.error statistic  p.value
   <chr>                                 <dbl>     <dbl>     <dbl>    <dbl>
 1 (Intercept)                         3026.      827.       3.66  0.000272
 2 Age                                   -4.13      6.05    -0.683 0.495   
 3 AttritionYes                         -54.4     124.      -0.438 0.661   
 4 BusinessTravelTravel_Frequently       50.2     149.       0.338 0.736   
 5 BusinessTravelTravel_Rarely          321.      126.       2.55  0.0110  
 6 `DepartmentResearch & Development`   344.      574.       0.600 0.549   
 7 DepartmentSales                     -190.      581.      -0.328 0.743   
 8 DistanceFromHome                      -3.88      4.78    -0.813 0.416   
 9 Education2                            17.9     142.       0.126 0.900   
10 Education3                           -20.2     131.      -0.154 0.878   
# ... with 50 more rows
[1] 1161.022
# A tibble: 60 x 5
   term                              estimate std.error statistic   p.value
   <chr>                                <dbl>     <dbl>     <dbl>     <dbl>
 1 (Intercept)                        8.01e+0   0.178     45.0    3.49e-200
 2 Age                                3.90e-5   0.00130    0.0299 9.76e-  1
 3 AttritionYes                      -5.93e-2   0.0267    -2.22   2.70e-  2
 4 BusinessTravelTravel_Frequently    3.06e-2   0.0320     0.955  3.40e-  1
 5 BusinessTravelTravel_Rarely        7.39e-2   0.0271     2.73   6.56e-  3
 6 `DepartmentResearch & Developme~   5.13e-2   0.124      0.415  6.78e-  1
 7 DepartmentSales                   -1.05e-2   0.125     -0.0842 9.33e-  1
 8 DistanceFromHome                  -4.83e-4   0.00103   -0.470  6.39e-  1
 9 Education2                        -3.74e-3   0.0306    -0.122  9.03e-  1
10 Education3                        -2.18e-2   0.0282    -0.771  4.41e-  1
# ... with 50 more rows
[1] 1206.878
# A tibble: 14 x 5
   term                            estimate std.error statistic p.value
   <chr>                              <dbl>     <dbl>     <dbl>   <dbl>
 1 (Intercept)                      3471.      197.     17.6    0      
 2 `JobRoleHuman Resources`         -926.      298.     -3.11   0.00197
 3 `JobRoleLaboratory Technician`  -1134.      190.     -5.96   0      
 4 JobRoleManager                   3184.      256.     12.5    0      
 5 `JobRoleManufacturing Director`   130.      172.      0.754  0.451  
 6 `JobRoleResearch Director`       3368.      224.     15.0    0      
 7 `JobRoleResearch Scientist`      -900.      193.     -4.66   0      
 8 `JobRoleSales Executive`           -7.16    148.     -0.0482 0.962  
 9 `JobRoleSales Representative`   -1122.      236.     -4.75   0      
10 JobLevel2                        1764.      151.     11.7    0      
11 JobLevel3                        5070.      207.     24.5    0      
12 JobLevel4                        8458.      307.     27.6    0      
13 JobLevel5                       11185.      362.     30.9    0      
14 TotalWorkingYears                  48.2       8.71    5.54   0      
[1] 1105.254
# A tibble: 14 x 5
   term                            estimate std.error statistic  p.value
   <chr>                              <dbl>     <dbl>     <dbl>    <dbl>
 1 (Intercept)                      8.02      0.0424   189.     0       
 2 `JobRoleHuman Resources`        -0.176     0.0642    -2.75   0.00617 
 3 `JobRoleLaboratory Technician`  -0.227     0.0410    -5.53   0       
 4 JobRoleManager                   0.260     0.0550     4.73   0       
 5 `JobRoleManufacturing Director`  0.0144    0.0370     0.390  0.697   
 6 `JobRoleResearch Director`       0.289     0.0483     5.98   0       
 7 `JobRoleResearch Scientist`     -0.167     0.0416    -4.01   0.000070
 8 `JobRoleSales Executive`        -0.00218   0.0319    -0.0682 0.946   
 9 `JobRoleSales Representative`   -0.267     0.0509    -5.25   0       
10 JobLevel2                        0.504     0.0325    15.5    0       
11 JobLevel3                        0.956     0.0445    21.5    0       
12 JobLevel4                        1.16      0.0660    17.5    0       
13 JobLevel5                        1.29      0.0779    16.6    0       
14 TotalWorkingYears                0.0106    0.00187    5.64   0       
[1] 1149.049
# A tibble: 25 x 5
   term                            estimate std.error statistic p.value
   <chr>                              <dbl>     <dbl>     <dbl>   <dbl>
 1 (Intercept)                       2891.       612.     4.72    0    
 2 `JobRoleHuman Resources`          -590.       620.    -0.952   0.341
 3 `JobRoleLaboratory Technician`   -1126.       189.    -5.95    0    
 4 JobRoleManager                    3414.       299.    11.4     0    
 5 `JobRoleManufacturing Director`     78.1      171.     0.456   0.648
 6 `JobRoleResearch Director`        3342.       226.    14.8     0    
 7 `JobRoleResearch Scientist`       -890.       192.    -4.63    0    
 8 `JobRoleSales Executive`           528.       350.     1.51    0.132
 9 `JobRoleSales Representative`     -547.       395.    -1.38    0.167
10 JobLevel2                         1795.       153.    11.7     0    
# ... with 15 more rows
[1] 1122.661
# A tibble: 16 x 5
   term                            estimate std.error statistic p.value
   <chr>                              <dbl>     <dbl>     <dbl>   <dbl>
 1 (Intercept)                       3193.     221.      14.5   0      
 2 `JobRoleHuman Resources`          -930.     295.      -3.15  0.00171
 3 `JobRoleLaboratory Technician`   -1133.     189.      -6.00  0      
 4 JobRoleManager                    3161.     254.      12.4   0      
 5 `JobRoleManufacturing Director`     90.2    171.       0.528 0.598  
 6 `JobRoleResearch Director`        3343.     223.      15.0   0      
 7 `JobRoleResearch Scientist`       -891.     191.      -4.65  0      
 8 `JobRoleSales Executive`           -25.8    147.      -0.175 0.861  
 9 `JobRoleSales Representative`    -1100.     235.      -4.68  0      
10 JobLevel2                         1802.     150.      12.0   0      
11 JobLevel3                         5113.     206.      24.9   0      
12 JobLevel4                         8499.     305.      27.9   0      
13 JobLevel5                        11163.     359.      31.1   0      
14 TotalWorkingYears                   48.3      8.64     5.59  0      
15 BusinessTravelTravel_Frequently     71.4    144.       0.496 0.620  
16 BusinessTravelTravel_Rarely        352.     122.       2.89  0.00402
[1] 1120.712
rf variable importance

  only 20 most important variables shown (out of 24)

                                Overall
JobLevel2                       100.000
JobLevel3                        95.225
TotalWorkingYears                51.181
JobRoleResearch Director         39.175
JobLevel4                        29.971
JobRoleSales Executive           24.835
YearsAtCompany                   23.536
JobRoleManager                   22.585
JobRoleLaboratory Technician     21.594
YearsInCurrentRole               20.869
BusinessTravelTravel_Rarely      17.844
JobLevel5                        17.585
JobRoleManufacturing Director    15.224
JobRoleSales Representative      10.796
JobRoleResearch Scientist        10.669
Education2                       10.218
Age                               9.877
DepartmentSales                   9.393
BusinessTravelTravel_Frequently   7.707
Education5                        5.336
[1] 1088.056
rf variable importance

                               Overall
JobLevel2                     100.0000
JobLevel3                      91.8112
TotalWorkingYears              43.0409
JobRoleLaboratory Technician   29.4250
JobRoleResearch Director       27.2399
JobLevel4                      22.3694
JobRoleManager                 13.4035
JobRoleManufacturing Director  11.4310
JobRoleSales Executive          6.3037
JobRoleResearch Scientist       5.8021
JobLevel5                       5.5704
JobRoleSales Representative     0.1003
JobRoleHuman Resources          0.0000
[1] 1071.911

Naive Bayes

Lets quick try a model with everything in it.

Confusion Matrix and Statistics

          Reference
Prediction No Yes
       No  86   6
       Yes 59  21
                                          
               Accuracy : 0.6221          
                 95% CI : (0.5451, 0.6948)
    No Information Rate : 0.843           
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.2062          
                                          
 Mcnemar's Test P-Value : 1.12e-10        
                                          
            Sensitivity : 0.7778          
            Specificity : 0.5931          
         Pos Pred Value : 0.2625          
         Neg Pred Value : 0.9348          
             Prevalence : 0.1570          
         Detection Rate : 0.1221          
   Detection Prevalence : 0.4651          
      Balanced Accuracy : 0.6854          
                                          
       'Positive' Class : Yes             
                                          
[1] 0.7257384

Not too bad. Specificity is just under the desired threshold.

Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  126  14
       Yes  19  13
                                          
               Accuracy : 0.8081          
                 95% CI : (0.7412, 0.8641)
    No Information Rate : 0.843           
    P-Value [Acc > NIR] : 0.9107          
                                          
                  Kappa : 0.3259          
                                          
 Mcnemar's Test P-Value : 0.4862          
                                          
            Sensitivity : 0.48148         
            Specificity : 0.86897         
         Pos Pred Value : 0.40625         
         Neg Pred Value : 0.90000         
             Prevalence : 0.15698         
         Detection Rate : 0.07558         
   Detection Prevalence : 0.18605         
      Balanced Accuracy : 0.67522         
                                          
       'Positive' Class : Yes             
                                          
[1] 0.8842105
Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  101  10
       Yes  44  17
                                         
               Accuracy : 0.686          
                 95% CI : (0.611, 0.7545)
    No Information Rate : 0.843          
    P-Value [Acc > NIR] : 1              
                                         
                  Kappa : 0.2157         
                                         
 Mcnemar's Test P-Value : 7.098e-06      
                                         
            Sensitivity : 0.62963        
            Specificity : 0.69655        
         Pos Pred Value : 0.27869        
         Neg Pred Value : 0.90991        
             Prevalence : 0.15698        
         Detection Rate : 0.09884        
   Detection Prevalence : 0.35465        
      Balanced Accuracy : 0.66309        
                                         
       'Positive' Class : Yes            
                                         
[1] 0.7890625
Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  107  11
       Yes  38  16
                                          
               Accuracy : 0.7151          
                 95% CI : (0.6414, 0.7812)
    No Information Rate : 0.843           
    P-Value [Acc > NIR] : 0.9999930       
                                          
                  Kappa : 0.2349          
                                          
 Mcnemar's Test P-Value : 0.0002038       
                                          
            Sensitivity : 0.59259         
            Specificity : 0.73793         
         Pos Pred Value : 0.29630         
         Neg Pred Value : 0.90678         
             Prevalence : 0.15698         
         Detection Rate : 0.09302         
   Detection Prevalence : 0.31395         
      Balanced Accuracy : 0.66526         
                                          
       'Positive' Class : Yes             
                                          
[1] 0.8136882
Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  106  12
       Yes  39  15
                                          
               Accuracy : 0.7035          
                 95% CI : (0.6292, 0.7706)
    No Information Rate : 0.843           
    P-Value [Acc > NIR] : 0.9999987       
                                          
                  Kappa : 0.2037          
                                          
 Mcnemar's Test P-Value : 0.0002719       
                                          
            Sensitivity : 0.55556         
            Specificity : 0.73103         
         Pos Pred Value : 0.27778         
         Neg Pred Value : 0.89831         
             Prevalence : 0.15698         
         Detection Rate : 0.08721         
   Detection Prevalence : 0.31395         
      Balanced Accuracy : 0.64330         
                                          
       'Positive' Class : Yes             
                                          
[1] 0.8060837
Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  100   8
       Yes  45  19
                                          
               Accuracy : 0.6919          
                 95% CI : (0.6171, 0.7599)
    No Information Rate : 0.843           
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.2525          
                                          
 Mcnemar's Test P-Value : 7.615e-07       
                                          
            Sensitivity : 0.7037          
            Specificity : 0.6897          
         Pos Pred Value : 0.2969          
         Neg Pred Value : 0.9259          
             Prevalence : 0.1570          
         Detection Rate : 0.1105          
   Detection Prevalence : 0.3721          
      Balanced Accuracy : 0.6967          
                                          
       'Positive' Class : Yes             
                                          
[1] 0.7905138

Random Forest (Attrition)

rf variable importance

  only 20 most important variables shown (out of 39)

                             Importance
OverTimeYes                      100.00
MonthlyIncome                     91.02
YearsInCurrentRole                63.57
TotalWorkingYears                 62.52
YearsWithCurrManager              58.07
Age                               54.53
EducationFieldMarketing           45.55
MaritalStatusSingle               45.44
JobRoleResearch Scientist         41.00
JobRoleLaboratory Technician      40.44
StockOptionLevel1                 40.23
YearsAtCompany                    40.23
DepartmentSales                   40.03
JobSatisfaction4                  38.06
JobRoleResearch Director          37.36
JobRoleSales Representative       36.76
JobSatisfaction2                  36.07
EnvironmentSatisfaction4          33.13
RelationshipSatisfaction3         30.64
NumCompaniesWorked                28.72
Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  136  21
       Yes   9   6
                                          
               Accuracy : 0.8256          
                 95% CI : (0.7605, 0.8791)
    No Information Rate : 0.843           
    P-Value [Acc > NIR] : 0.77179         
                                          
                  Kappa : 0.1955          
                                          
 Mcnemar's Test P-Value : 0.04461         
                                          
            Sensitivity : 0.9379          
            Specificity : 0.2222          
         Pos Pred Value : 0.8662          
         Neg Pred Value : 0.4000          
             Prevalence : 0.8430          
         Detection Rate : 0.7907          
   Detection Prevalence : 0.9128          
      Balanced Accuracy : 0.5801          
                                          
       'Positive' Class : No              
                                          
[1] 0.9006623
rf variable importance

  only 20 most important variables shown (out of 37)

                                 Importance
OverTimeYes                          100.00
MonthlyIncome                         94.61
TotalWorkingYears                     47.87
JobRoleResearch Scientist             46.75
MaritalStatusSingle                   46.68
StockOptionLevel1                     46.42
JobRoleSales Representative           38.65
EducationFieldMarketing               33.89
NumCompaniesWorked                    31.00
PerformanceRating4                    23.15
RelationshipSatisfaction2             21.19
JobSatisfaction4                      20.74
Age                                   19.77
MaritalStatusMarried                  19.05
EducationFieldMedical                 19.00
DepartmentResearch & Development      18.85
YearsInCurrentRole                    18.37
StockOptionLevel2                     18.06
DepartmentSales                       17.72
JobRoleManager                        16.78
Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  137  21
       Yes   8   6
                                          
               Accuracy : 0.8314          
                 95% CI : (0.7669, 0.8841)
    No Information Rate : 0.843           
    P-Value [Acc > NIR] : 0.70582         
                                          
                  Kappa : 0.2078          
                                          
 Mcnemar's Test P-Value : 0.02586         
                                          
            Sensitivity : 0.9448          
            Specificity : 0.2222          
         Pos Pred Value : 0.8671          
         Neg Pred Value : 0.4286          
             Prevalence : 0.8430          
         Detection Rate : 0.7965          
   Detection Prevalence : 0.9186          
      Balanced Accuracy : 0.5835          
                                          
       'Positive' Class : No              
                                          
[1] 0.9042904

William Arnost

2019-08-18